# ruff: noqa: N803, N806from typing import Callableimport brancaimport geopandas as gpdimport odc.geo.xr # noqa: F401import matplotlib.pyplot as pltimport numpy as npimport pandas as pdimport scipy.optimizeimport scipy.spatialimport xarray as xrfrom numpy.typing import ArrayLikefrom rich.progress import trackfrom scipy.optimize import curve_fitimport seaborn as snsimport matplotlib as mplimport foliumimport branca as bcfrom matplotlib import colormaps as cmfrom matplotlib import colorsimport branca.colormap as bcm# Import cupy (cuda accelerated numpy) if availabletry:import cupy as cpimport cupy_xarray # noqa: F401import cupyx.scipy.spatialexceptImportError: cp =Nonempl.rcParams.update(mpl.rcParamsDefault)# Set Seaborn color palettecmap ="Set2"sns.set_palette(cmap)# Update Matplotlib rcParamsmpl.rcParams.update( {#'font.family': 'Consolas', # Font family for all text"figure.dpi": 200, # High resolution for figures# Axes settings"axes.titlesize": 12, # Font size for the axes title"axes.titleweight": "normal", # Weight for the axes title"axes.labelsize": 10, # Font size for the axes labels# Tick settings"xtick.direction": "in", # Ticks pointing inwards"ytick.direction": "in", # Ticks pointing inwards"xtick.minor.visible": True, # Show minor ticks on x-axis"ytick.minor.visible": True, # Show minor ticks on y-axis# Grid settings"axes.grid": True, # Enable grid by default"axes.grid.which": "both", # Show both major and minor grids"grid.linestyle": ":", # Dotted grid lines"grid.alpha": 0.3, # Transparency for major grid lines"grid.linewidth": 0.6, # Line width for major grid lines })plt.rcParams["ytick.right"] = plt.rcParams["xtick.top"] =Trueplt.rcParams["ytick.left"] = plt.rcParams["ytick.labelleft"] =True# Convert geodataframe to xarray DataArraydef _sparse_to_xarray_direct(gdf: gpd.GeoDataFrame, col: str):# Ensure we have numeric coordinates gdf["x"] = gdf.geometry.x gdf["y"] = gdf.geometry.y# Create pivot table pivoted = gdf.pivot_table(index="x", columns="y", values=col)# Create xarray DataArray da = xr.DataArray( pivoted.to_numpy()[:, ::-1], dims=["x", "y"], coords={"x": pivoted.index, "y": pivoted.columns[::-1]}, ).transpose("y", "x") da = da.odc.assign_crs(gdf.crs)return da
Introduction
Arsenic contamination of groundwater has created a public health crisis in Bangladesh. It affects millions of people who rely on the contaminated water for drinking and domestic purposes. The extent of this contamination varies widely across the country, with some areas having dangerously high levels (Hossain 2006). The contamination originates from naturally occurring arsenic in the aquifer sediments, which is mobilized into groundwater through geochemical processes. The most widely supported explanation is the reductive dissolution of iron oxyhydroxides under anoxic conditions(Nickson et al. 1998), though some studies suggest that irrigation-induced oxidation may also play a role in certain areas(Hossain 2006). Robust methods are needed to predict where arsenic concentrations will exceed safety thresholds, given the widespread reliance on groundwater and the health risks of arsenic exposure. Geostatistical indicator techniques provide a powerful approach to this challenge, allowing the assessment and mapping of the risk of arsenic contamination above critical levels such as 10 µg/l and 50 µg/l. Specifically, Indicator Simulation and kriging can be used to generate maps to provides valuable information for risk assessment and mitigation strategies.
Dataset and area of interest
This analysis uses a dataset of concentrations of different species measured in irregularly spaced groundwater wells in Bangladesh. The data were collected in 1998 and 1999, and our analysis uses only arsenic concentrations. In addition, a simple boundary shape of Bangladesh is used for this analysis. To simplify the analysis, values below the detection limit of 6 or 5 are replaced by 5 or 6 \(\mu g/ l\), depending on the detector. The Dataset contains unique Sample IDs and unique Sample Field IDs but revelas that there are 113 duplicates or more occurences of the same coordinate pairs. These double Coordinate pairs are removed because the would result in a singular linear equation system for the Kriging interpolation. As a coordinate reference system the data is stored as WGS84 for the analysis the dataset is reprojected to the local EPSG:9678 coordinate reference system.
waterquality = pd.read_csv("data/NationalSurveyData.csv", sep=",", skipinitialspace=True, skiprows=(0, 1, 2, 3, 5, 6),)# We are only interest in the As column (and lat+lon)waterquality = waterquality[["SAMPLE_ID", "As", "LAT_DEG", "LONG_DEG"]]# Convert the Aq values to numericwaterquality["As"] = waterquality["As"].str.replace(r"^<", "", regex=True).astype(float)# Convert to a GeoDataFramewaterquality = gpd.GeoDataFrame( waterquality, geometry=gpd.points_from_xy(waterquality["LONG_DEG"], waterquality["LAT_DEG"]),).set_crs(epsg=4326)# Normalize the geometries and drop duplicate pointswaterquality["geometry"] = waterquality.normalize()waterquality = waterquality.drop_duplicates(subset="geometry")# Convert to a better fitting crswaterquality = waterquality.to_crs(epsg=9678)surveyarea = gpd.read_file("data/borders.shp")surveyarea = surveyarea.to_crs(epsg=9678)
The raw data can now be displayed in Figure 1. The map displays the distribution of arsenic concentration in Bangladesh, with blue dots representing areas of low arsenic levels and red dots indicating regions with higher contamination. The blue dots are widespread across the northern, western, and parts of the eastern regions, while the red dots are concentrated mainly in the central and southeastern areas, particularly around the Ganges Delta.
Make this Notebook Trusted to load map: File -> Trust Notebook
Figure 1: arsenic concentration in Bangladesh
Methods
We assume to hava a set of random sampled observations at the postions \(\mathbf{x_p}\) with assigned observations of a certain attribute \(Z(\mathbf{x_p})\). These attributes may are distributed according to their postion. In order to interpolate the given attribute field \(Z(\mathbf{x})\) the spatial variability of \(Z(\mathbf{x})\) has to be described and can then be used to perform the covariance based interpolation.
Variogram
Empirical Variogram
These attribute values will have certain stochastic properties. Furthermore the stochastic properties are connected to the spatial distribution of the points - or they are not. In order to describe this, we analyze the spatial variabiability of our attribute \(Z(\mathbf{x_p})\) using a Variogram. The Variogram describes the (Semi-) Variance of our measured values according to their distance to other observations inside our probe.
The variograms semivarinace is calculated as followed:
the variogram measures the half mean squared difference of the data values of \(Z\). where \(h\) is the distance between points in the lag class, \(N(h)\) is the number of points with the distance h (Number of points per lag class). And \(x_i\) is the position of the i-th point and \(Z\) is the attribute value at the position \(x_i\). \(x_i\) is element of \(\mathbf{x_p}\)(Benndorf and Wiesbaden 2023)
Theoretical Variogram
With the semivariogram we yield an empirical description of the spatial variability using lag classes for \(h\). In order to derive a continuous semivariance, a theoretical model is fitted to the discrete data points in our variogram. Certain variogram models exist for different purposes and data characteristics. The most widely used models are the Spherical Model, the Exponential Model, and the Gaussian Model. More models exist.
Based on our study area and the measured variable, the Arsen concentration, the choose the Sphreical Model for further analyis. It is well fitted for observations of concentrations or contamination (Benndorf and Wiesbaden 2023)
Figure 2 Generally, the variogram is described by 3 paramters. The Sill, the maximum semivarinace the distribution reaches. The Nugget, is the ofset between zero and the lowest semivariance at \(h=0\). And the Range, is the maximum \(h\) value from where to semivarinace becomes constant and is not increasing anymore.
The formula for the spherical variogram model is given by:
\[
\gamma(h) =
\begin{cases}
0 & \text{if } h = 0 \\
C_0 + C_1 \left[ \frac{3h}{2a} - \frac{1}{2} \left( \frac{h}{a} \right)^3 \right] & \text{if } 0 < h \leq a \\
C_0 + C_1 & \text{if } h > a
\end{cases}
\]
where: - \(h\) is the lag distance, - \(C_0\) is the nugget, - \(C_1\) is the sill minus the nugget, (Note: Sill is \(C=C_0+C_1\)) - \(a\) is the range.
The semivarinace describes the variablitly in our dataset, but for the interpolation tasks we need a measure of the covarinace between our datapoints. The semivariance is directly related to covarinace by the term:
\[Cov(h) = Var(Z) - \gamma(h)\]
Because the sill in our variogramm is the \(Var(Z)\) the equation can be easily solved. Or the Covarinace can be calculated directly from the parameters \(C_0, C_1, a\).
The Covariance is derived from the Semmivariance:
\[
Cov(h) =
\begin{cases}
C_0 + C_1 & \text{if } h = 0 \\
C_1 - C_1 \left[ \frac{3h}{2a} - \frac{1}{2} \left( \frac{h}{a} \right)^3 \right] & \text{if } 0 < h \leq a \\
0 & \text{if } h > a
\end{cases}
\]
Kriging
Krigin as an interpolation technique from the field of geostatistics that uses distance and correlation between the sample points to find the best estimator. Kriging meets the conditions to be the Best Linear Unbiased Estimator (BLUE).
“Kriging is basically a generalization of minimum mean square error estimation taking into account spatial correlation”(Sagar, Cheng, and Agterberg 2018)
Kriging is based on the work by the south african miner Danie G. Krige. Several variations of Kriging exists. We discusse the simplest form Simple Kriging, Ordinary Krigin, and Indicator Krigin in this chapter.
In general the estimaton of the attribute value \(Z^*(\mathbf{x_0})\) at the interpolation location \(\mathbf{x_0}\) is calculated with the weighted mean from the sourrounding points. The weights are based on the variogram modell and thus on the correlation between datapoints.
This can be demonstrated with the Ordinary Kringing estimator for example. Where \(n\leq N\) is a subset of points from the observations closest to the interpolation location. z is the attribut value from the ith observation, and \(\lambda_i\) is the weight determined by solving an equation system of the covariance (see next section (Sagar, Cheng, and Agterberg 2018))
Simple Kriging
The Weigths for the Kriging are Calulated via the Covariance for each interpolation point:
The estimator for \(\mathbf{x_0}\) is calculated like:
\[\begin{align*}
Z^*_{SK}(x_0) &= \sum_{i=0}^{n} \lambda_i \cdot (z(x_i) - m) + m \\
\\
\Leftrightarrow \ \
Z^*_{SK}(x_0) &= \lambda^{T} \cdot (\mathbf{Z} - m) + m
\end{align*}\]
The last line is the matrix notation that can be used in python for a faster implementation of the operations. Note that \(m=m(x)=constant\) is the mean of \(Z\) in the study area
Ordinary Kriging
Ordinary Kringing fullfills the constraint
\[
\sum_{i=0}^{n} \lambda_i \overset{!}{=} 1
\]
what leads the the estimation of the espected value (E) in each local neighborhood and ensures the unbiasedness of the estimator.
With \(\mathbf{Z}\) beeing the list of all observed values and \(\lambda^{T}\) the corresponding weights both resulting in one scalar value \(Z^*_{OK}(x_0)\).
Indicator Kriging
Indicator Kriging is a helpful tool to quantify the spatial probability of the occurence of a state of the given datapoints. In our case the probability of the exceedence of a Arsen threshold is calculated. Basis of the indicator kriging is the devision of the data in 2 states; 0 and 1.
Because we used the indicator transformaed data \(I(z(x_i),k_z)\) for the kriging function, the result \(\mathbf{P^*_{OK}}\) has a value range of \([0,1]\). For every interpolation point \(x_0\) the probability \(P^*_{OK}(x_0)\) for the threshold exceedence is calculated from the local neigborhood.
Geostatistical Simulation
Krigin interpolation delivers the best possible estimate for the locations with no data samples from observation. Nevertheless, the technique is not able the reproduce the same variabilty as in the original data for the interpolation. The interpolated surface \(Z^*\) is smoothed because kriging finds least squares solution for the interpolatd points. So, the stochastic properties are different from the original data, the variability is underestimated.
This can be seen from the variogram, the spatial variabilitly is of \(Z^*\) is lower than the spatial variability of the observations of \(Z\). Note the histogram is also different abd displays just the data variability without the spatial context.
comparison of semivariance for original and kriged data
For applications that require an accurate variance estimation of the kriged field \(Z^*\), stochastic simulation are used to estimate the variance of the interpolation.
Relevant for this report are the sequential indicator simulation (SIS) and the closely related sequentail gaussian simulation (SGS). Both simulations are based on the generation of artifical realisations of the state of \(Z\). Often the number of realisation is choosen as 50-100 (Benndorf and Wiesbaden 2023). Is is accomplished with a random path throug the interpoaltion grid, the kriging of the estimated values and the usage of the know propability distribution of the data.
Sequentail gaussian simulation is used the simulate the variance for continous attribut fields. Sequential indicator simulation is used for categorical data.
We use the indicator simulation in this report.
Sequential Indicator Simulation
As already described the indicator simulation is based on a random path throug all our grid points in our interpolation grid. Furthermore a so called database is necessary. This database conatains all values from our original observations. With each step through the grid, this database is appendend for the interpolated value. That fact makes this simulation very computational costly for bigger interpolation areas. The algorith can be described as followed:
Set a random path inside the interpolation grid.
Acces each gridpoint and compute kriging in the local neighborhood. This result in the probavility of \(Z\geq k\) the occurance of the catergory.
Generate a random value from the uniforme distribution \(~U[0,1]\).
If the random value is smaller than the probability for \(Z\geq k\) (the category) set the gridpoint to 1. That means it blongs tho the category of the threshold \(Z\geq k\). If not then the gridpoint does not belong to the catergory of \(Z\geq k\) and is set to 0.
Add the result from 4. to the database. The database therefor grows for 1 element.
All our functions interface with geopandas, especially the shapely geometries. Results are wrapped as xarray.DataArray object, which are also used for intermediate calculations. xarray helps us to keep track of coordinate information and dimension order, which results in more readable code. Further, me make use of CUDA-accelerated functions provided by cupy for the computationally intensive parts of the code. Especially this makes the code a little bit more complex, but also much faster. A often found pattern in our code is:
if cp isnotNone:# use cupyelse:# use numpy
cp is a reference to the cupy module if that module is present on the system. Else cp is None, because of the way we import cupy (and related libraries):
try:import cupy as cp ... # import other cupy related librariesexceptImportError: cp =None
Overview about variable naming (We tried to match literature as close as possible):
h: lag distance
z: dataset values
u: unknown, to be interpolated points
c: sill.
c0: nugget
c1: sill minus nugget
a: range of the model
C0: Covariances between unknown points and known points
Ch: Covariances between known points
Below are the main functions of our project implemented:
fastdist: A fast version of pairwise distance calculation using cupy if available.
empirical_semivariogram: Calculate the empirical semivariogram of a dataset.
spherical: A spherical model, later used for fitting a theoretical semivariogram.
spherical_fast: A reduced spherical model, which does not calculate the sill.
cov_from_semivar: Calculate the covariance from a semivariogram model.
ordinary_kriging: Ordinary Kriging interpolation.
ordinary_kriging_simulation: Sequential (Indicator) Simulation with Ordinary Kriging.
About spherical_fast: This function only calculates the spherical part of the model, hence producing wrong results if the lag h is larger than the range of the model a. However, since we should process only data pairs below the range, we can safely use this function for the kriging part. It is a lot faster than the full spherical model and uses less memory.
def fastdist(a: gpd.GeoSeries, b: gpd.GeoSeries, to_host: bool=False) -> xr.DataArray:"""Calculate the pairwise distances between two sets of geometries. Utilizes cupy if available for GPU acceleration. Args: a (gpd.GeoSeries): The first set of geometries. b (gpd.GeoSeries): The second set of geometries. to_host (bool, optional): Whether to return the result on the host. Defaults to False. Returns: xr.DataArray: The pairwise distances between the two sets of geometries. """if cp isnotNone: a_arr = cp.array([a.x, a.y], order="F").T.reshape(-1, 2) b_arr = cp.array([b.x, b.y], order="F").T.reshape(-1, 2) d = cupyx.scipy.spatial.distance_matrix(a_arr, b_arr)if to_host: d = d.get()else: a_arr = np.array([a.x, a.y], order="F").T.reshape(-1, 2) b_arr = np.array([b.x, b.y], order="F").T.reshape(-1, 2) d = scipy.spatial.distance_matrix(a_arr, b_arr)return xr.DataArray(d, dims=("a", "b"), coords={"a": a.index, "b": b.index})
def empirical_semivariogram(z: gpd.GeoSeries, geom: gpd.GeoSeries, h: ArrayLike |None=None) -> xr.DataArray:"""Calculate the empirical semivariogram. Args: z (gpd.GeoSeries): The data to calculate the semivariogram for. geom (gpd.GeoSeries): The geometries of the data. Must be points. h (list[float] | None): The bins to calculate the semivariogram for. If None, the bins are estimated by linspacing 0 - distances.median() with 40 bins total. Defaults to None. Returns: xr.DataArray: The empirical semivariogram. """# Calculate the pairwise distances d = fastdist(geom, geom, to_host=True).rename({"a": "tail", "b": "head"})# Remove upper triangle d = d.where(np.triu(np.ones(d.shape), k=1).astype(bool))# Remove self-distances d = d.where(~np.eye(len(z), dtype=bool))# If h is not provided, estimate the lags based on the median distanceif h isNone: h = np.linspace(0, d.median(), 40)# Calculate the pairwise data differences z_head = xr.DataArray(z, coords={"head": z.index}, name="As").expand_dims({"tail": z.index}) z_tail = xr.DataArray(z, coords={"tail": z.index}, name="As").expand_dims({"head": z.index}) z_diff = z_head - z_tail# Calculate the semivariogram gamma = xr.DataArray(0.0, dims=["lag"], coords={"lag": h[1:]})for i inrange(1, len(h)): is_in_lag = (h[i -1] < d) & (d <= h[i]) n_h = is_in_lag.sum(dim=["tail", "head"]) gamma.loc[{"lag": h[i]}] =1/ (2* n_h) * (z_diff.where(is_in_lag) **2).sum(["tail", "head"])# This vectorized approach is taken too much memory, but could potentially be faster:# d_lags = xr.DataArray(np.digitize(d, h, right=True), coords={"tail": z.index, "head": z.index})# is_in_lag = xr.DataArray(# [d_lags == i for i in range(1, len(h))],# coords={"lag": h[1:], "tail": z.index, "head": z.index}# )# n_h = is_in_lag.sum(dim=["tail", "head"])# gamma = 1 / (2 * n_h) * (z_diff.where(is_in_lag) ** 2).sum(["tail", "head"])return gamma
def spherical(h: ArrayLike, c0: float, c: float, a: float) -> ArrayLike:"""Calculate the theoretical semivariogram from a spherical model. Args: h (ArrayLike): The lags to calculate the semivariogram for. c0 (float): The nugget effect. c (float): The sill. a (float): The range. Returns: ArrayLike: The spherical semivariogram model. """ semivar = np.zeros_like(h) curve_mask = (0< h) & (h <= a) line_mask = h > a semivar[curve_mask] = c0 + c * (3* h[curve_mask] / (2* a) -0.5* (h[curve_mask] / a) **3) semivar[line_mask] = c0 + creturn semivardef spherical_fast(h: ArrayLike, c0: float, c: float, a: float) -> ArrayLike:"""Calculate the theoretical semivariogram from a spherical model without satisfying constraints. Should not be used for fitting! Should not be used for h larger than a! Args: h (ArrayLike): The lags to calculate the semivariogram for. c0 (float): The nugget effect. c (float): The sill. a (float): The range. Returns: ArrayLike: The spherical semivariogram model. """return c0 * (h !=0) + c * (3* h / (2* a) -0.5* (h / a) **3)def cov_from_semivar(semivar: ArrayLike, c0: float, c: float) -> ArrayLike:"""Calculate the covariance from a semivariogram model. Args: semivar (ArrayLike): The semivariogram model. c0 (float): The nugget effect. c (float): The sill. Returns: ArrayLike: The covariance model. """return c + c0 - semivar
# ruff: noqa: N803, N806def ordinary_kriging( u: gpd.GeoSeries, z: gpd.GeoSeries, geom: gpd.GeoSeries, model_fn: Callable, c0: float, c: float, a: float, neighbor_factor: float=1,) ->tuple[xr.DataArray, xr.DataArray, xr.DataArray]:"""Interpolate points u with ordinary kriging based on a fitted semi-variogram model and support points z. Args: u (gpd.GeoSeries): The points to interpolate ("nodes"). z (gpd.GeoSeries): The values of the support points. geom (gpd.GeoSeries): The geometries of the support points. model_fn (Callable): The model function to use for the semi-variogram. c0 (float): The nugget effect. c (float): The sill. a (float): The range. neighbor_factor (float, optional): The factor to determine the neighborhood range. Defaults to 0.2. Returns: tuple[xr.DataArray, xr.DataArray, xr.DataArray]: A tuple containing: - The interpolated, kriged values. - The kriging variance. - The mu value, which is part of the ordinarity constraint. """# Calculate the covariance matrix C(0) between the unknown points (nodes) and the support points d0 = fastdist(geom, u) C0 = cov_from_semivar(model_fn(d0, c0, c, a), c0, c)# Filter out distances which are larger than the range a, since we don't want to use them C0 = C0.where(d0 <= a * neighbor_factor, 0)# Rename the dimensions to match our naming scheme for the rest of the function C0 = C0.rename({"a": "support", "b": "nodes"})# Calculate the pariwise covariance matrix C(h) between the support points dh = fastdist(geom, geom) Ch = cov_from_semivar(model_fn(dh, c0, c, a), c0, c).rename({"a": "head", "b": "tail"})# Apply the constraints that the lambdas must sum up to 1 C0_constrained = C0.pad({"support": (0, 1)}, constant_values=1) Ch_constrained = Ch.pad({"tail": (0, 1), "head": (0, 1)}, constant_values=1) Ch_constrained[{"tail": -1, "head": -1}] =0# Calculate the weights on cuda if available, else via numpyif cp isnotNone: C0_constrained = C0_constrained.cupy.as_cupy() Ch_constrained = Ch_constrained.cupy.as_cupy() lam = cp.linalg.solve( Ch_constrained.data, C0_constrained.data, )else: lam = np.linalg.solve( Ch_constrained.data, C0_constrained.data, ) lam = xr.DataArray(lam, coords=C0_constrained.coords, name="lambda")# Extract the last lambda as the mu, which is part of the ordinarity constraint lam, mu = lam[:-1], lam[-1] mu = mu.rename("mu")# Convert the support values to a DataArray z = xr.DataArray(z, coords={"support": z.index})if cp isnotNone: z = z.cupy.as_cupy()# Calculate the kriged values kriged = xr.dot(lam, z, dims="support").rename("kriged")# Error calculation kriged_var = (c + c0) - xr.dot(lam, C0, dims="support")if cp isnotNone: kriged = kriged.cupy.as_numpy() kriged_var = kriged_var.cupy.as_numpy() mu = mu.cupy.as_numpy() cp.get_default_memory_pool().free_all_blocks()return kriged, kriged_var, mu
# ruff: noqa: N803, N806def ordinary_kriging_simulation( u: gpd.GeoDataFrame, z: gpd.GeoSeries, geom: gpd.GeoSeries, model_fn: Callable, c0: float, c: float, a: float, neighbor_factor: float=0.2, nsim: int=1,) -> xr.DataArray |tuple[xr.DataArray, xr.DataArray]:"""Simulate points u with with ordinary kriging based on a fitted semi-variogram model and support points z. Excpects z to be indicators. Args: u (gpd.GeoSeries): The points to interpolate ("nodes"). z (gpd.GeoSeries): The values of the support points. geom (gpd.GeoSeries): The geometries of the support points. model_fn (Callable): The model function to use for the semi-variogram. c0 (float): The nugget effect. c (float): The sill. a (float): The range. neighbor_factor (float, optional): The factor to determine the neighborhood range. Defaults to 0.2. nsim (int, optional): The number of simulations to run. Defaults to 1. Returns: xr.DataArray: The simulated values if nsim is 1, else a tuple containing the mean of simulated values and their variance. """# Shuffle unknown points to get a random path orig_index = u.index u = u.sample(frac=1)# Precaulate the pairwise distances and their respective covariance matrices by puttin together all geometries# In the loop we then can take slices of that large matrix# With this we avoid recalculation and concatenation of the covariance matrices in the loop all_geoms = pd.concat([geom, u.set_index(u.index + geom.index.max()).geometry]) d_all = fastdist(all_geoms, all_geoms) d_all = d_all.rename({"a": "head", "b": "support"}) neighbor_range = a * neighbor_factor neighbors = (d_all <= neighbor_range).as_numpy() C_all = cov_from_semivar(model_fn(d_all, c0, c, a), c0, c) C_all = C_all.where(d_all <= a, 0) simulated = xr.DataArray(0, dims=["simidx", "support"], coords={"simidx": range(nsim), "support": orig_index})# Create a progress bar if we simulate more than one realization sim_iter =range(nsim)if nsim >1: sim_iter = track(sim_iter, description="Simulating", total=nsim)for simidx in sim_iter:# Turn z into an xarray DataArray with space for the unknown points# This array will also contain the simulated values z_all = xr.DataArray(0, dims=["support"], coords={"support": all_geoms.index}, name="z") z_all.loc[{"support": z.index}] = z# Move z to cuda if availableif cp isnotNone: z_all = z_all.cupy.as_cupy()# Create an indicator which points are already in the dataset# We calculate this ahead of time to increase speed of the function dataset_indicator = xr.DataArray(False, dims=["support"], coords={"support": all_geoms.index}, name="dataset_indicator" ) dataset_indicator.loc[{"support": geom.index}] =True# Create random generator and random uniform values ahead of time, again to increase speed rng = np.random.default_rng() r = rng.uniform(size=len(u))# i: Iteration counter -> The number of already interpolated points# uidx: The index of the unknown point# ui: The unknown pointfor i, (uidx, ui) inenumerate(u.iterrows()):# Take slices based on current iteration# Our C0 and Ch grows with each iteration, because we add more points to our dataset# C0 grows linear and Ch grows quadratically uidx_adj = uidx + geom.index.max() compute_idx = (neighbors.sel(head=uidx_adj) & dataset_indicator).data C0 = C_all.sel(head=uidx_adj)[{"support": compute_idx}] Ch = C_all[{"support": compute_idx, "head": compute_idx}]# Apply the constraints that the lambdas must sum up to 1 C0_constrained = C0.pad({"support": (0, 1)}, constant_values=1) Ch_constrained = Ch.pad({"support": (0, 1), "head": (0, 1)}, constant_values=1) Ch_constrained[{"support": -1, "head": -1}] =0# Calculate the weights on cuda if available, else via numpyif cp isnotNone: C0_constrained = C0_constrained.cupy.as_cupy() Ch_constrained = Ch_constrained.cupy.as_cupy() lam = cp.linalg.solve( Ch_constrained.data, C0_constrained.data, )else: lam = np.linalg.solve( Ch_constrained.data, C0_constrained.data, ) lam = xr.DataArray(lam, coords=C0_constrained.coords, name="lambda")# Extract the last lambda as the mu, which is part of the ordinarity constraint lam, _ = lam[:-1], lam[-1]# Calculate the kriged values kriged_current = z_all[{"support": compute_idx}] @ lam# Turn kriged values into a simulated value and add it to the dataset z_sim = (kriged_current > r[i]).astype(int) z_all.loc[{"support": uidx_adj}] = z_sim dataset_indicator.loc[{"support": uidx_adj}] =True# Free memory if cuda is usedif cp isnotNone:del lam, kriged_current, z_sim cp.get_default_memory_pool().free_all_blocks()# Move everything back to numpy if cuda was used and free memoryif cp isnotNone: z_all = z_all.cupy.as_numpy() cp.get_default_memory_pool().free_all_blocks()# Reindex the adjusted index to the original index to extract the simulated values in original order z_all.coords["support"] = z_all.coords["support"] - geom.index.max() simulated.loc[{"simidx": simidx}] = z_all.sel(support=orig_index)if nsim ==1:return simulated.sel(simidx=0)return simulated.mean("simidx"), simulated.var("simidx")
Results
Variogram
The empirical semi-variogram(Figure 3) shows a gradual increase in semivariance with lag distance, reaching a sill at approximately 14,500. The fitted spherical model closely follows the empirical data. This indicates a well-defined spatial structure. A small nugget effect is also visible. The covariance function decreases with distance, further confirming the spatial dependence pattern. These results indicate that the dataset exhibits strong spatial correlation up to the defined range, making it suitable for geostatistical interpolation methods like kriging.
# Calculate gammagamma = empirical_semivariogram(waterquality["As"], waterquality.geometry)# Fit the spherical model to the empirical semivariogram(c0, c, a), cov = curve_fit(spherical, gamma.coords["lag"], gamma, p0=[0, gamma.mean(), gamma.coords["lag"].mean()])print(f"Model parameters: nugget(c0)={c0:.2f}, sill(c)={c:.2f}, range(a)={a:.2f}")
Model parameters: nugget(c0)=8471.19, sill(c)=6442.55, range(a)=87611.10
Figure 3: Variogram of arsenic concentration in Bangladesh
We can now examine the indicator-coded variograms for the arsenic thresholds k=10 and k=50. The empirical semi-variograms have been computed for both thresholds. Compared to the previous variogram, the plots in the Figure 4 show a lower overall semivariance due to the binary transformation of the data, which captures exceedance probabilities rather than absolute concentrations. Both variograms show a clear spatial structure, with the empirical semivariograms (black dots) closely following the theoretical spherical model (orange line). For both thresholds, the fast spherical variogram also fits the empirical variogram up to the range of the model. Since, as described in the programming section, we should only process data pairs below the range, we can safely use this function for the kriging part.
# Indicator encode As values for k = 10 and k = 50waterquality["As_ind:10"] = (waterquality["As"] >=10).astype(int)waterquality["As_ind:50"] = (waterquality["As"] >=50).astype(int)
# Calculate gamma for both thresholdsgamma_10 = empirical_semivariogram(waterquality["As_ind:10"], waterquality.geometry)gamma_50 = empirical_semivariogram(waterquality["As_ind:50"], waterquality.geometry)# Fit the spherical model to the empirical semivariogram(c0_10, c_10, a_10), cov_10 = curve_fit( spherical, gamma_10.coords["lag"], gamma_10, p0=[0, gamma_10.mean(), gamma_10.coords["lag"].mean()])(c0_50, c_50, a_50), cov_50 = curve_fit( spherical, gamma_50.coords["lag"], gamma_50, p0=[0, gamma_50.mean(), gamma_50.coords["lag"].mean()])print(f"Model parameters for k=10: nugget(c0)={c0_10:.2f}, sill(c)={c_10:.2f}, range(a)={a_10:.2f}")print(f"Model parameters for k=50: nugget(c0)={c0_50:.2f}, sill(c)={c_50:.2f}, range(a)={a_50:.2f}")
Model parameters for k=10: nugget(c0)=0.14, sill(c)=0.10, range(a)=103023.87
Model parameters for k=50: nugget(c0)=0.11, sill(c)=0.08, range(a)=130991.17
Figure 4: Variogram of indicator coded arsenic concentration in Bangladesh
Kriging
For kriging, a regular grid with a resolution of 2000 meters was created to cover the area of interest. The grid points were generated using a mesh grid that spanned the bounding box of the survey area. After generation, the grid was filtered to retain only points within the defined survey area, ensuring that interpolation was performed only in relevant regions.
# Create a grid, spanning the area of interestres =2000# mx = np.arange(surveyarea.bounds.minx.item(), surveyarea.bounds.maxx.item(), res)y = np.arange(surveyarea.bounds.miny.item(), surveyarea.bounds.maxy.item(), res)X, Y = np.meshgrid(x, y)grid_gdf = gpd.GeoDataFrame(geometry=gpd.points_from_xy(X.ravel(), Y.ravel(), crs=surveyarea.crs))# Filter out points that are not within the survey areagrid_gdf = grid_gdf[grid_gdf.within(surveyarea.loc[0, "geometry"])]
# Calculate the kriged values for the gridgrid_gdf["kriged"], grid_gdf["kriged_var"], grid_gdf["mu"] = ordinary_kriging( grid_gdf.geometry, waterquality["As"], waterquality.geometry, spherical_fast, c0, c, a)
Ordinary kriging
Using ordinary kriging, we generated a continuous surface of predicted arsenic concentrations based on observed sample points. The interpolated map (Figure 5) highlights elevated arsenic levels in southern and central regions, with concentrations exceeding 250 \(\mathrm{\mu g/l}\) in localized hotspots.
In contrast, northern and northeastern areas have significantly lower arsenic levels. The kriging variance layer shows the spatial uncertainty of the interpolation. Lower variance is observed in central Bangladesh, where data density is higher, indicating greater confidence in the predictions. In contrast, higher variance is observed near the borders and coastal areas where data availability is lower.
Make this Notebook Trusted to load map: File -> Trust Notebook
Figure 5: Kriged arsenic concentration in Bangladesh
Indicator kriging
Looking at the resulting maps of the indicator krigged approach for the 10 \(\mu g/ l\) threshold Figure 6, red areas indicate a low probability of arsenic levels remaining below the threshold, meaning that these regions are more likely to experience higher contamination. These high-risk areas are predominantly concentrated in central and southeastern Bangladesh, with the highest risks found in and around the capital Dhaka, consistent with previously identified contamination hotspots. In contrast, blue areas represent a high probability of arsenic concentrations remaining below 10 \(\mu g/ l\), indicating safer conditions, particularly in the northern and southwestern regions. However, if we compare the two maps, we can see that the 10 \(\mu g/ l\) risk map predicts a high probability of exceeding 10 \(\mu g/ l\) in large areas in the central and northern regions, some of which were considered safe by the direct ordinary kriging approach.
Make this Notebook Trusted to load map: File -> Trust Notebook
Figure 6: Indicator Kriged arsenic concentration < 10 in Bangladesh
In contrast, the 50 \(\mu g/ l\) map shows a more localized distribution of red zones, indicating that fewer areas are above this higher threshold. The regions with a high probability of arsenic concentrations below the threshold are more extensive on the 50 \(\mu g/ l\) map than on the \(\mu g/ l\) map. This suggests that while fewer locations meet the more stringent 10 \(\mu g/ l\) criterion, more areas remain below 50 \(\mu g/ l\).
Make this Notebook Trusted to load map: File -> Trust Notebook
Figure 7: Indicator Kriged arsenic concentration < 50 in Bangladesh
As shown, kriging results in a very smooth inerpolation of the grid points. Hence, the variance of the kriging data is way lower than the variance of the original data:
var_orig_data_10 = waterquality["As_ind:10"].var()var_orig_data_50 = waterquality["As_ind:50"].var()var_kriged_data_10 = grid_gdf["kriged_10"].var()var_kriged_data_50 = grid_gdf["kriged_50"].var()print(f"Original variance for k=10: {var_orig_data_10:.2f} vs. indicator kriged variance: {var_kriged_data_10:.2f}")print(f"Original variance for k=50: {var_orig_data_50:.2f} vs. indicator kriged variance: {var_kriged_data_50:.2f}")
Original variance for k=10: 0.24 vs. indicator kriged variance: 0.07
Original variance for k=50: 0.19 vs. indicator kriged variance: 0.06
Sequential Indicator Simulation
The simulation uses a regular grid with a resolution of 10000 meters created to cover the area of interest. This is done to achieve a manageable computation time with limited hardware resources while still relying on the self-implemented algorithm in Python. Further, instead of 100 simulations we only performed 20 simulations because of the same reasons.
We tried to utilize all computation-hacks known to us to speed up the simulation process. Comparing to straighforward implementations found online, we achieved a speedup of approx. 100-1000x. This is, however, still significantly slower than the C-implementation used in the R-package gstat. We can’t explain why the C-implementation is so much faster, but we suspect that they took some shortcuts in the simulation process, e.g. by only simulating a fraction of points and random filling the holes. With our implementation, we reached some limitations regarding memory consumption: The simulation process involves a lot of distance calculations to gain a. the local neighborhood for each point and b. to calculate the covariance between all neighbors of a point. To avoid multiple calculation of the same distances and covariances we precalculate all distances and neighborhood covariances. Since the points from the grid are appended to the original dataset, the number of distances to calculate is (n_dataset * n_grid)**2, compared to normal kriging where the number of distances is just n_dataset**2 + n_dataset * n_grid. For a fine grid, this takes up a lot of memory, too much for our systems to handle. Without the pre-calculation, the simulation process would be 10x slower, which is not acceptable. We already tried using Sparse Matrices, but couldn’t gain a significant speedup there either. One thing we didn’t tried yet is to use KD-Trees for the distance calculation and neighborhood search, but we would probably run into similar memory issues.
Using CUDA already increased the speed of the calculation of gamma values which are used for the kriging process. This actually was an important step for the (no-simulation) kriging algorithm, because we already calculated the gamma values in a vectorized way and putting it on a GPU increased the speed by a factor of 100. Because the simulation can’t be vectorized, the speedup here is not as significant, but still noticeable. Here, the memory problem comes also into play, because the GPU has less memory than the CPU
As a result, we were forced to reduce the grid-size significantly and also reduced the number of simulations. The key takeaways, however, are still visible in the results: The mean of all the simulations is somewhat similar to the ordinary kriging result, but the variance is closer to the real data.
Code
# Create a grid, spanning the area of interestres =10000# mx = np.arange(surveyarea.bounds.minx.item(), surveyarea.bounds.maxx.item(), res)y = np.arange(surveyarea.bounds.miny.item(), surveyarea.bounds.maxy.item(), res)X, Y = np.meshgrid(x, y)small_grid_gdf = gpd.GeoDataFrame(geometry=gpd.points_from_xy(X.ravel(), Y.ravel(), crs=surveyarea.crs))# Filter out points that are not within the survey areasmall_grid_gdf = small_grid_gdf[small_grid_gdf.within(surveyarea.loc[0, "geometry"])]
The map of 10 µg/L mean values from 20 indicator simulations shows the probability distribution of arsenic concentrations across Bangladesh. High probability areas are concentrated in the north and southeast, while low probability areas dominate the central and eastern regions, forming clear and distinct patches.
Code
sim_grid_10 = _sparse_to_xarray_direct(small_grid_gdf, "example_sim_10")sim_grid_50 = _sparse_to_xarray_direct(small_grid_gdf, "example_sim_50")m_sim = folium.Map(tiles="CartoDB positron")waterquality.explore(m=m_sim, column="As", cmap=colorbar, vmax=vmax, name="As (µg/L)", show=False)sim_grid_10.odc.explore(m_sim, cmap="coolwarm", name="Simulated Indicator for As < 10 ug/L")sim_grid_50.odc.explore(m_sim, cmap="coolwarm", name="Simulated Indicator for As < 50 ug/L")surveyarea.boundary.explore(m=m_sim, style_kwds={"fill": False, "color": "black"}, name="Survey Area")mp_cmap = cm.get_cmap("coolwarm")cb_colors = np.apply_along_axis(colors.to_hex, 1, mp_cmap(range(mp_cmap.N)))colorbar_sim_10 = bc.colormap.LinearColormap( cb_colors, vmin=sim_grid_10.min().item(), vmax=sim_grid_10.max().item(), caption="Simulated Indicator for As < 10 ug/L",).to_step(n=2)mp_cmap = cm.get_cmap("coolwarm")cb_colors = np.apply_along_axis(colors.to_hex, 1, mp_cmap(range(mp_cmap.N)))colorbar_sim_50 = bc.colormap.LinearColormap( cb_colors, vmin=sim_grid_50.min().item(), vmax=sim_grid_50.max().item(), caption="Simulated Indicator for As < 50 ug/L",).to_step(n=2)m_sim.add_child(colorbar_sim_10)m_sim.add_child(colorbar_sim_50)folium.LayerControl().add_to(m_sim)folium.FitOverlays().add_to(m_sim)m_sim
Make this Notebook Trusted to load map: File -> Trust Notebook
Figure 8: Simulated Indicator arsenic concentration < 50 in Bangladesh
Even with the small amount of simulated points, the variance now look more similar to the original data - our simulation process was successful in capturing the spatial variability of the indicated arsenic concentrations:
var_orig_data_10 = waterquality["As_ind:10"].var()var_orig_data_50 = waterquality["As_ind:50"].var()var_sim_data_10 = small_grid_gdf["example_sim_10"].var()var_sim_data_50 = small_grid_gdf["example_sim_50"].var()print(f"Original variance for k=10: {var_orig_data_10:.2f} vs. simulated variance: {var_sim_data_10:.2f}")print(f"Original variance for k=50: {var_orig_data_50:.2f} vs. simulated variance: {var_sim_data_50:.2f}")
Original variance for k=10: 0.24 vs. simulated variance: 0.24
Original variance for k=50: 0.19 vs. simulated variance: 0.16
Now we run the simulation 20 times and calculate the mean and variance of the simulated data. The main benefit (and reason) for the indicator simulation is the variance we get for the state of each pixel. The variance from the simulation tells in what range this value can possibly differ from the mean. This variance follows the variability of the original (observed) data and displays the local variability that is ignored by the kriging methods. Because of that, the simulation approach is necessary for applications that not just need well-interpolated values, but are also dependent on the certainty of these values based on the original data.
In the maps below we can see that in the center, but also in the north the variance of the simulated data is quite small - meaning that the indicator kriged data is quite certain at these points. This effect is stronger for the \(\mathrm{50 \mu g /l}\) threshold, probably because it covers a larger fraction of positive pixels.
sim_mean_grid_10 = _sparse_to_xarray_direct(small_grid_gdf, "sim_mean_10")sim_var_grid_10 = _sparse_to_xarray_direct(small_grid_gdf, "sim_var_10")m_sim = folium.Map(tiles="CartoDB positron")waterquality.explore(m=m_sim, column="As", cmap=colorbar, vmax=vmax, name="As (µg/L)", show=False)sim_var_grid_10.odc.explore(m_sim, cmap="GnBu", name="Variance of Simulated Indicator for As < 10 ug/L")sim_mean_grid_10.odc.explore(m_sim, cmap="coolwarm", name="Mean of Simulated Indicator for As < 10 ug/L", show=False)surveyarea.boundary.explore(m=m_sim, style_kwds={"fill": False, "color": "black"}, name="Survey Area")mp_cmap = cm.get_cmap("coolwarm")cb_colors = np.apply_along_axis(colors.to_hex, 1, mp_cmap(range(mp_cmap.N)))colorbar_sim_mean = bc.colormap.LinearColormap( cb_colors, vmin=sim_mean_grid_10.min().item(), vmax=sim_mean_grid_10.max().item(), caption="Mean of Simulated Indicator for As < 10 ug/L",)mp_cmap = cm.get_cmap("GnBu")cb_colors = np.apply_along_axis(colors.to_hex, 1, mp_cmap(range(mp_cmap.N)))colorbar_sim_var = bc.colormap.LinearColormap( cb_colors, vmin=sim_var_grid_10.min().item(), vmax=sim_var_grid_10.max().item(), caption="Variance of Simulated Indicator for As < 10 ug/L",)m_sim.add_child(colorbar_sim_mean)m_sim.add_child(colorbar_sim_var)folium.LayerControl().add_to(m_sim)folium.FitOverlays().add_to(m_sim)m_sim
Make this Notebook Trusted to load map: File -> Trust Notebook
Figure 9: Average of 20 simulated Indicator arsenic concentration < 10 in Bangladesh
Code
sim_mean_grid_50 = _sparse_to_xarray_direct(small_grid_gdf, "sim_mean_50")sim_var_grid_50 = _sparse_to_xarray_direct(small_grid_gdf, "sim_var_50")m_sim = folium.Map(tiles="CartoDB positron")waterquality.explore(m=m_sim, column="As", cmap=colorbar, vmax=vmax, name="As (µg/L)", show=False)sim_var_grid_50.odc.explore(m_sim, cmap="GnBu", name="Variance of Simulated Indicator for As < 50 ug/L")sim_mean_grid_50.odc.explore(m_sim, cmap="coolwarm", name="Mean of Simulated Indicator for As < 50 ug/L", show=False)surveyarea.boundary.explore(m=m_sim, style_kwds={"fill": False, "color": "black"}, name="Survey Area")mp_cmap = cm.get_cmap("coolwarm")cb_colors = np.apply_along_axis(colors.to_hex, 1, mp_cmap(range(mp_cmap.N)))colorbar_sim_mean = bc.colormap.LinearColormap( cb_colors, vmin=sim_mean_grid_50.min().item(), vmax=sim_mean_grid_50.max().item(), caption="Mean of Simulated Indicator for As < 50 ug/L",)mp_cmap = cm.get_cmap("GnBu")cb_colors = np.apply_along_axis(colors.to_hex, 1, mp_cmap(range(mp_cmap.N)))colorbar_sim_var = bc.colormap.LinearColormap( cb_colors, vmin=sim_var_grid_50.min().item(), vmax=sim_var_grid_50.max().item(), caption="Variance of Simulated Indicator for As < 50 ug/L",)m_sim.add_child(colorbar_sim_mean)m_sim.add_child(colorbar_sim_var)folium.LayerControl().add_to(m_sim)folium.FitOverlays().add_to(m_sim)m_sim
Make this Notebook Trusted to load map: File -> Trust Notebook
Average of 20 simulated Indicator arsenic concentration < 50 in Bangladesh
Simulation vs. Kriging
Figure 11 shows the comparison between the kriging and average simulation results for the 10 µg/L threshold. The kriging map displays a smooth interpolation of the grid points, while the simulation still shows some variance. However, the average simulation results becomes more smoother and similar to the kriging results over time. For larger number of simulations the average of the simulations would converge to the kriging result, if the scale it not respected. The kriged values can exceed the range of the original data (0-1), while the simulation results are always within this range.
Make this Notebook Trusted to load map: File -> Trust Notebook
Figure 10: Kriged Indicator arsenic concentration < 10 in Bangladesh
Make this Notebook Trusted to load map: File -> Trust Notebook
Figure 11: Average of 20 simulated Indicator arsenic concentration < 10 in Bangladesh
Discussion and comparison between Indicator Kriging and Ordinary Kriging
Comparing indicator kriging and ordinary kriging shows clear differences. Indicator kriging uses thresholds to estimate the probability of exceeding certain levels. For a 10 µg/L threshold (Figure 5), indicator kriging marks large high-risk areas in central and southeastern Bangladesh, matching known hotspots. It also flags some northern regions as high-risk, while ordinary kriging labels them safe. Indicator kriging captures threshold-based risk better, but it loses continuous detail and depends on chosen thresholds. Ordinary kriging is simpler and often more accurate for normally distributed data, though it may over-smooth important boundaries.
Another benefit of indicator kriging is the ability to estimate a CDF for each grid point, by running multiple krigings with different thresholds. These CDFs can be used for multiple purposes, e.g. more sufisticated simulations or risk assessment.
Policy maker and law enforcement can also make better use of indicator kriged maps to identify where certain policy-set thresholds are potentially exceeded, because indicator kriging provides them with a probability of exceedance instead of a hard value. Here, the problem of kriging over-/under shoot is less important, which occurs if the ordinary kriging contains a few extreme values. These extremes could smooth out non-extreme values, e.g. in our case, the high values in the center lead to super-low and even negative values a little further north, were the indicator kriging still shows a high probability of threshold exceedance.
Make this Notebook Trusted to load map: File -> Trust Notebook
Figure 12: Kriged Indicator arsenic concentration < 10 in Bangladesh
Make this Notebook Trusted to load map: File -> Trust Notebook
Figure 13: Kriged arsenic concentration in Bangladesh
References
Benndorf, J., and Springer Fachmedien Wiesbaden. 2023. Angewandte Geodatenanalyse Und -Modellierung: Eine Einführung in Die Geostatistik für Geowissenschaftler Und Geoingenieure. Erfolgreich Studieren. Springer Fachmedien Wiesbaden. https://books.google.de/books?id=XnlR0AEACAAJ.
Hossain, M. F. 2006. “Arsenic Contamination in Bangladesh—an Overview.”Agriculture, Ecosystems & Environment 113 (1): 1–16. https://doi.org/https://doi.org/10.1016/j.agee.2005.08.034.
Nickson, R, J McArthur, W Burgess, K M Ahmed, P Ravenscroft, and M Rahman. 1998. “Arsenic Poisoning of Bangladesh Groundwater.”Nature 395 (6700): 338.
Sagar, Daya, Qiuming Cheng, and Frits Agterberg. 2018. Handbook of Mathematical Geosciences: Fifty Years of IAMG. Springer Nature. https://doi.org/10.1007/978-3-319-78999-6.